Multimedia event detection (MED) is a challenging\nproblem because of the heterogeneous content and variable\nquality found in large collections of Internet videos. To\nstudy the value of multimedia features and fusion for representing\nand learning events from a set of example video clips,\nwe created SESAME, a system for video SEarch with Speed\nand Accuracy for Multimedia Events. SESAME includes\nmultiple bag-of-words event classifiers based on single data\ntypes: low-level visual, motion, and audio features; highlevel\nsemantic visual concepts; and automatic speech recognition.\nEvent detection performance was evaluated for each\nevent classifier. The performance of low-level visual and\nmotion features was improved by the use of difference coding.\nThe accuracy of the visual concepts was nearly as strong\nas that of the low-level visual features. Experiments with a\nnumber of fusion methods for combining the event detection\nscores from these classifiers revealed that simple fusion\nmethods, such as arithmetic mean, perform as well as or better\nthan other, more complex fusion methods. SESAME�s performance in the 2012 TRECVID MED evaluation was\none of the best reported.
Loading....